PGH Transit Atlas - EDA Notebook¶

Rizaldy Utomo | Public Policy, Analytics, AI Management @ CMU | rutomo@andrew.cmu.edu

🚌 🚴 Interactive Dashboard: https://rzrizaldy.github.io/pgh-transit-atlas/


This notebook performs full ETL + EDA on Pittsburgh's POGOH bikeshare and PRT bus data.

Data Sources:

  • POGOH trip data (Nov 2024 - Oct 2025) - 12 Excel files
  • POGOH station locations
  • PRT bus stop usage
In [1]:
# Import libraries
import pandas as pd
import numpy as np
import json
from pathlib import Path
from datetime import datetime
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler

import matplotlib.pyplot as plt
import seaborn as sns

import folium
from folium.plugins import MarkerCluster

import warnings
warnings.filterwarnings('ignore')

# Config for inline display
%matplotlib inline
sns.set_style("whitegrid")
plt.rcParams['figure.figsize'] = (12, 6)

print("✓ Libraries imported")
✓ Libraries imported

1. Data Ingestion¶

In [2]:
# Load station metadata
stations = pd.read_excel('dataset/pogoh-station-locations-october-2025.xlsx')
print(f"Loaded {len(stations)} stations")
stations.head()
Loaded 60 stations
Out[2]:
Id Name Total Docks Latitude Longitude
0 1 Pierce St & Summerlea St 19 40.456507 -79.932331
1 2 Eliza Furnace Trail & Swineburne St 15 40.425800 -79.953400
2 3 Centre Ave & Addison St 15 40.444600 -79.978300
3 4 Burns White Center at 3 Crossings 15 40.456400 -79.980000
4 5 Allegheny Station 19 40.448301 -80.018007
In [3]:
# Load 12 months of trip data
trip_files = [
    'november-2024.xlsx', 'december-2024.xlsx', 'january-2025.xlsx',
    'february-2025.xlsx', 'march-2025.xlsx', 'april-2025.xlsx',
    'may-2025.xlsx', 'june-2025.xlsx', 'july-2025.xlsx',
    'august-2025.xlsx', 'september-2025.xlsx', 'pogoh-october-2025.xlsx'
]

trips_list = []
for file in trip_files:
    df = pd.read_excel(f'dataset/{file}')
    trips_list.append(df)
    print(f"{file}: {len(df):,} trips")

trips = pd.concat(trips_list, ignore_index=True)
print(f"\nTotal: {len(trips):,} trips")
november-2024.xlsx: 43,436 trips
december-2024.xlsx: 19,739 trips
january-2025.xlsx: 15,940 trips
february-2025.xlsx: 23,571 trips
march-2025.xlsx: 35,940 trips
april-2025.xlsx: 47,523 trips
may-2025.xlsx: 32,922 trips
june-2025.xlsx: 36,345 trips
july-2025.xlsx: 43,958 trips
august-2025.xlsx: 65,153 trips
september-2025.xlsx: 101,198 trips
pogoh-october-2025.xlsx: 91,817 trips

Total: 557,542 trips
In [4]:
# Check data structure
trips.head(10)
Out[4]:
Closed Status Duration Start Station Id Start Date Start Station Name End Date End Station Id End Station Name Rider Type
0 NORMAL 253 28 2024-11-30 23:51:33 Fifth Ave & S Bouquet St 2024-11-30 23:55:46 20.0 Boulevard of the Allies & Parkview Ave MEMBER
1 NORMAL 1368 59 2024-11-30 23:49:22 Forbes Ave at TCS Hall (CMU Campus) 2024-12-01 00:12:10 59.0 Forbes Ave at TCS Hall (CMU Campus) MEMBER
2 NORMAL 1394 59 2024-11-30 23:49:12 Forbes Ave at TCS Hall (CMU Campus) 2024-12-01 00:12:26 59.0 Forbes Ave at TCS Hall (CMU Campus) MEMBER
3 NORMAL 198 12 2024-11-30 23:47:49 O'Hara St and University Place 2024-11-30 23:51:07 34.0 N Dithridge St & Centre Ave MEMBER
4 NORMAL 210 28 2024-11-30 23:45:32 Fifth Ave & S Bouquet St 2024-11-30 23:49:02 20.0 Boulevard of the Allies & Parkview Ave MEMBER
5 NORMAL 281 51 2024-11-30 23:38:04 Coltart Ave & Forbes Ave 2024-11-30 23:42:45 20.0 Boulevard of the Allies & Parkview Ave MEMBER
6 NORMAL 323 37 2024-11-30 23:31:23 S Negley Ave & Centre Ave 2024-11-30 23:36:46 30.0 Shady Ave & Ellsworth Ave MEMBER
7 NORMAL 1372 28 2024-11-30 23:25:31 Fifth Ave & S Bouquet St 2024-11-30 23:48:23 59.0 Forbes Ave at TCS Hall (CMU Campus) MEMBER
8 NORMAL 1323 28 2024-11-30 23:25:19 Fifth Ave & S Bouquet St 2024-11-30 23:47:22 59.0 Forbes Ave at TCS Hall (CMU Campus) MEMBER
9 NORMAL 201 13 2024-11-30 23:21:02 S Bouquet Ave & Sennott St 2024-11-30 23:24:23 28.0 Fifth Ave & S Bouquet St MEMBER
In [5]:
trips.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 557542 entries, 0 to 557541
Data columns (total 9 columns):
 #   Column              Non-Null Count   Dtype         
---  ------              --------------   -----         
 0   Closed Status       557542 non-null  object        
 1   Duration            557542 non-null  int64         
 2   Start Station Id    557542 non-null  int64         
 3   Start Date          557542 non-null  datetime64[ns]
 4   Start Station Name  557542 non-null  object        
 5   End Date            557542 non-null  datetime64[ns]
 6   End Station Id      557440 non-null  float64       
 7   End Station Name    557440 non-null  object        
 8   Rider Type          557542 non-null  object        
dtypes: datetime64[ns](2), float64(1), int64(2), object(4)
memory usage: 38.3+ MB

2. Data Cleaning & Feature Engineering¶

In [6]:
# Parse dates and filter outliers
trips['Start Date'] = pd.to_datetime(trips['Start Date'])
trips['End Date'] = pd.to_datetime(trips['End Date'])

initial = len(trips)
trips = trips[trips['Duration'] <= 14400]  # Filter > 4 hours
print(f"Filtered {initial - len(trips):,} outliers (>4hrs)")
print(f"Remaining: {len(trips):,} trips")
Filtered 1,027 outliers (>4hrs)
Remaining: 556,515 trips
In [7]:
# Add temporal features
trips['hour'] = trips['Start Date'].dt.hour
trips['day_of_week'] = trips['Start Date'].dt.dayofweek
trips['day_name'] = trips['Start Date'].dt.day_name()
trips['date'] = trips['Start Date'].dt.date

trips[['Start Date', 'hour', 'day_name']].head(10)
Out[7]:
Start Date hour day_name
0 2024-11-30 23:51:33 23 Saturday
1 2024-11-30 23:49:22 23 Saturday
2 2024-11-30 23:49:12 23 Saturday
3 2024-11-30 23:47:49 23 Saturday
4 2024-11-30 23:45:32 23 Saturday
5 2024-11-30 23:38:04 23 Saturday
6 2024-11-30 23:31:23 23 Saturday
7 2024-11-30 23:25:31 23 Saturday
8 2024-11-30 23:25:19 23 Saturday
9 2024-11-30 23:21:02 23 Saturday
In [8]:
# Merge coordinates
trips = trips.merge(
    stations[['Id', 'Latitude', 'Longitude']],
    left_on='Start Station Id', right_on='Id', how='left'
).rename(columns={'Latitude': 'start_lat', 'Longitude': 'start_lon'})

trips = trips.merge(
    stations[['Id', 'Latitude', 'Longitude']],
    left_on='End Station Id', right_on='Id', how='left', suffixes=('', '_end')
).rename(columns={'Latitude': 'end_lat', 'Longitude': 'end_lon'})

print("✓ Merged station coordinates")
trips[['Start Station Name', 'start_lat', 'start_lon']].head()
✓ Merged station coordinates
Out[8]:
Start Station Name start_lat start_lon
0 Fifth Ave & S Bouquet St 40.442200 -79.957600
1 Forbes Ave at TCS Hall (CMU Campus) 40.444654 -79.947535
2 Forbes Ave at TCS Hall (CMU Campus) 40.444654 -79.947535
3 O'Hara St and University Place 40.445117 -79.957118
4 Fifth Ave & S Bouquet St 40.442200 -79.957600
In [9]:
# Calculate Haversine displacement
def haversine_distance(lat1, lon1, lat2, lon2):
    R = 6371000
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a))
    return R * c

trips['displacement'] = haversine_distance(
    trips['start_lat'], trips['start_lon'],
    trips['end_lat'], trips['end_lon']
)

print("✓ Calculated displacement")
trips[['Duration', 'displacement']].describe()
✓ Calculated displacement
Out[9]:
Duration displacement
count 556515.000000 556437.000000
mean 705.091241 1252.059636
std 953.957185 1131.984984
min 0.000000 0.000000
25% 229.000000 516.520501
50% 397.000000 868.728209
75% 824.000000 1655.615969
max 14389.000000 10776.503161
In [10]:
# Flag Campus Corridor trips
CAMPUS_LAT_MIN, CAMPUS_LAT_MAX = 40.435, 40.450
CAMPUS_LON_MIN, CAMPUS_LON_MAX = -79.970, -79.940

trips['is_campus'] = (
    ((trips['start_lat'] >= CAMPUS_LAT_MIN) & (trips['start_lat'] <= CAMPUS_LAT_MAX) &
     (trips['start_lon'] >= CAMPUS_LON_MIN) & (trips['start_lon'] <= CAMPUS_LON_MAX)) |
    ((trips['end_lat'] >= CAMPUS_LAT_MIN) & (trips['end_lat'] <= CAMPUS_LAT_MAX) &
     (trips['end_lon'] >= CAMPUS_LON_MIN) & (trips['end_lon'] <= CAMPUS_LON_MAX))
)

campus_pct = trips['is_campus'].sum() / len(trips) * 100
print(f"Campus Corridor: {trips['is_campus'].sum():,} ({campus_pct:.1f}%)")
print(f"City-wide: {(~trips['is_campus']).sum():,} ({100-campus_pct:.1f}%)")
Campus Corridor: 378,811 (68.1%)
City-wide: 177,704 (31.9%)

3. K-Means Clustering: Trip Archetypes¶

Features: Duration, Displacement, Hour (normalized)

In [11]:
# Prepare features
trips_clean = trips.dropna(subset=['displacement', 'Duration', 'hour'])
features = trips_clean[['Duration', 'displacement', 'hour']].values
print(f"Feature matrix: {features.shape}")

# Normalize
scaler = StandardScaler()
features_scaled = scaler.fit_transform(features)
print(f"Duration: mean={scaler.mean_[0]:.1f}s, std={scaler.scale_[0]:.1f}")
print(f"Displacement: mean={scaler.mean_[1]:.1f}m")
Feature matrix: (556437, 3)
Duration: mean=705.0s, std=953.7
Displacement: mean=1252.1m
In [12]:
# K-Means (k=4)
kmeans = KMeans(n_clusters=4, random_state=42, n_init=10)
trips_clean['archetype'] = kmeans.fit_predict(features_scaled)

print(f"✓ K-Means inertia: {kmeans.inertia_:.1f}")
print("\nCluster sizes:")
print(trips_clean['archetype'].value_counts().sort_index())
✓ K-Means inertia: 647166.6

Cluster sizes:
archetype
0    266603
1     87128
2    182663
3     20043
Name: count, dtype: int64
In [13]:
# Analyze centroids and label
archetype_labels = []
for i in range(4):
    cluster = trips_clean[trips_clean['archetype'] == i]
    avg_dur = cluster['Duration'].mean()
    avg_dist = cluster['displacement'].mean()
    avg_hour = cluster['hour'].mean()
    weekend_pct = (cluster['day_of_week'] >= 5).mean() * 100
    
    # Label logic
    if (avg_hour >= 7 and avg_hour <= 9) or (avg_hour >= 17 and avg_hour <= 19):
        label = "Commuter"
    elif weekend_pct > 40:
        label = "Leisure"
    elif avg_dur < 600:
        label = "Last-Mile"
    else:
        label = "Errand"
    
    archetype_labels.append({
        'cluster': i, 'label': label, 'count': len(cluster),
        'avg_duration': avg_dur, 'avg_displacement': avg_dist,
        'avg_hour': avg_hour, 'weekend_pct': weekend_pct
    })
    
    print(f"Cluster {i} → {label}")
    print(f"  {len(cluster):,} trips ({len(cluster)/len(trips_clean)*100:.1f}%)")
    print(f"  Dur: {avg_dur/60:.1f}min, Dist: {avg_dist:.0f}m, Hour: {avg_hour:.1f}\n")
Cluster 0 → Commuter
  266,603 trips (47.9%)
  Dur: 7.7min, Dist: 836m, Hour: 17.7

Cluster 1 → Errand
  87,128 trips (15.7%)
  Dur: 20.1min, Dist: 3359m, Hour: 14.8

Cluster 2 → Last-Mile
  182,663 trips (32.8%)
  Dur: 7.0min, Dist: 910m, Hour: 9.3

Cluster 3 → Leisure
  20,043 trips (3.6%)
  Dur: 73.2min, Dist: 737m, Hour: 14.4

In [14]:
# Map labels
archetype_map = {a['cluster']: a['label'] for a in archetype_labels}
trips_clean['archetype_label'] = trips_clean['archetype'].map(archetype_map)

trips_clean[['Start Station Name', 'Duration', 'displacement', 'archetype_label']].head(10)
Out[14]:
Start Station Name Duration displacement archetype_label
0 Fifth Ave & S Bouquet St 253 992.447135 Commuter
1 Forbes Ave at TCS Hall (CMU Campus) 1368 0.000000 Commuter
2 Forbes Ave at TCS Hall (CMU Campus) 1394 0.000000 Commuter
3 O'Hara St and University Place 198 769.589720 Commuter
4 Fifth Ave & S Bouquet St 210 992.447135 Commuter
5 Coltart Ave & Forbes Ave 281 876.021030 Commuter
6 S Negley Ave & Centre Ave 323 1063.997498 Commuter
7 Fifth Ave & S Bouquet St 1372 894.352617 Commuter
8 Fifth Ave & S Bouquet St 1323 894.352617 Commuter
9 S Bouquet Ave & Sennott St 201 167.470362 Commuter

4. Temporal Analysis¶

In [15]:
# Daily timeseries
daily_pogoh = trips.groupby('date').size().reset_index(name='trips')
daily_campus = trips[trips['is_campus']].groupby('date').size().reset_index(name='campus_trips')
daily_city = trips[~trips['is_campus']].groupby('date').size().reset_index(name='city_trips')

daily_full = daily_pogoh.merge(daily_campus, on='date', how='left').merge(daily_city, on='date', how='left').fillna(0)
daily_full['date'] = pd.to_datetime(daily_full['date'])

print(f"{len(daily_full)} days")
daily_full.head(10)
365 days
Out[15]:
date trips campus_trips city_trips
0 2024-11-01 2391 1798 593
1 2024-11-02 1851 1216 635
2 2024-11-03 1778 1251 527
3 2024-11-04 2206 1801 405
4 2024-11-05 2219 1621 598
5 2024-11-06 2240 1780 460
6 2024-11-07 2457 1950 507
7 2024-11-08 2440 1903 537
8 2024-11-09 1778 1134 644
9 2024-11-10 719 565 154
In [16]:
# Peak/trough
peak = daily_full.loc[daily_full['trips'].idxmax()]
trough = daily_full.loc[daily_full['trips'].idxmin()]
print(f"Peak: {peak['date'].strftime('%b %d')} with {int(peak['trips']):,} trips")
print(f"Trough: {trough['date'].strftime('%b %d')} with {int(trough['trips']):,} trips")
print(f"Variance: {peak['trips']/trough['trips']:.1f}x")
Peak: Sep 19 with 3,892 trips
Trough: Dec 22 with 98 trips
Variance: 39.7x
In [17]:
# Heatmap
heatmap_data = trips.groupby(['day_name', 'hour']).size().reset_index(name='count')
days_ordered = ['Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
heatmap_pivot = heatmap_data.pivot_table(index='hour', columns='day_name', values='count', fill_value=0)
heatmap_pivot = heatmap_pivot.reindex(columns=days_ordered, fill_value=0)

print(f"Heatmap: {heatmap_pivot.shape}")
heatmap_pivot.head(10)
Heatmap: (24, 7)
Out[17]:
day_name Monday Tuesday Wednesday Thursday Friday Saturday Sunday
hour
0 904.0 734.0 872.0 916.0 1544.0 2266.0 2402.0
1 562.0 404.0 474.0 507.0 1028.0 1980.0 1945.0
2 305.0 289.0 275.0 254.0 613.0 1288.0 1331.0
3 178.0 196.0 143.0 185.0 272.0 586.0 618.0
4 168.0 178.0 143.0 156.0 176.0 235.0 263.0
5 303.0 427.0 335.0 394.0 256.0 195.0 200.0
6 899.0 1059.0 894.0 906.0 884.0 506.0 469.0
7 2636.0 2941.0 2833.0 2534.0 2032.0 890.0 843.0
8 3376.0 3542.0 3858.0 3169.0 3071.0 1403.0 1302.0
9 3488.0 4335.0 4134.0 3633.0 3608.0 2410.0 2138.0

5. Demographics¶

In [18]:
# Top 10 stations
top_stations = trips.groupby('Start Station Name').size().nlargest(10).index
demo_data = trips[trips['Start Station Name'].isin(top_stations)].groupby(['Start Station Name', 'Rider Type']).size().unstack(fill_value=0)
demo_data
Out[18]:
Rider Type CASUAL MEMBER
Start Station Name
Allequippa St & Darragh St 137 16607
Atwood St & Bates St 303 33692
Boulevard of the Allies & Parkview Ave 310 29228
Coltart Ave & Forbes Ave 407 26554
Fifth Ave & S Bouquet St 413 24998
N Dithridge St & Centre Ave 374 32472
O'Hara St & University Place 287 27018
S Bouquet Ave & Sennott St 630 56314
Schenley Dr & Schenley Dr Ext 915 18722
Zulema St & Coltart Ave 277 25803
In [19]:
# Overall rider type
rider_totals = trips['Rider Type'].value_counts()
rider_pct = trips['Rider Type'].value_counts(normalize=True) * 100
for r in rider_totals.index:
    print(f"{r}: {rider_totals[r]:,} ({rider_pct[r]:.1f}%)")
MEMBER: 517,200 (92.9%)
CASUAL: 39,315 (7.1%)

6. Station Behavioral Profiling¶

In [20]:
# Station × archetype percentages
station_totals = trips_clean.groupby('Start Station Name').size().reset_index(name='total_trips')
station_arch = trips_clean.groupby(['Start Station Name', 'archetype_label']).size().reset_index(name='arch_trips')
station_arch_pct = station_arch.merge(station_totals, on='Start Station Name')
station_arch_pct['pct'] = (station_arch_pct['arch_trips'] / station_arch_pct['total_trips']) * 100
station_arch_pct = station_arch_pct[station_arch_pct['total_trips'] >= 50]

station_arch_pct.head(15)
Out[20]:
Start Station Name archetype_label arch_trips total_trips pct
0 10th St & Penn Ave Commuter 2732 6659 41.027181
1 10th St & Penn Ave Errand 1541 6659 23.141613
2 10th St & Penn Ave Last-Mile 1682 6659 25.259048
3 10th St & Penn Ave Leisure 704 6659 10.572158
4 17th St & Penn Ave Commuter 2271 5567 40.793964
5 17th St & Penn Ave Errand 1795 5567 32.243578
6 17th St & Penn Ave Last-Mile 951 5567 17.082809
7 17th St & Penn Ave Leisure 550 5567 9.879648
8 21st St & Penn Ave Commuter 1896 6095 31.107465
9 21st St & Penn Ave Errand 2120 6095 34.782609
10 21st St & Penn Ave Last-Mile 1492 6095 24.479081
11 21st St & Penn Ave Leisure 587 6095 9.630845
12 42nd St & Butler St Commuter 1121 4419 25.367730
13 42nd St & Butler St Errand 2139 4419 48.404616
14 42nd St & Butler St Last-Mile 783 4419 17.718941
In [21]:
# Top 3 per archetype
station_archetype_top = {}
for archetype in ['Commuter', 'Last-Mile', 'Errand', 'Leisure']:
    top_3 = station_arch_pct[station_arch_pct['archetype_label'] == archetype].nlargest(3, 'pct')
    station_archetype_top[archetype] = {
        'stations': top_3['Start Station Name'].tolist(),
        'percentages': top_3['pct'].round(1).tolist(),
        'trip_counts': top_3['arch_trips'].astype(int).tolist(),
        'total_trips': top_3['total_trips'].astype(int).tolist()
    }
    print(f"\n{archetype}:")
    for i, station in enumerate(top_3['Start Station Name']):
        print(f"  {i+1}. {station}: {top_3.iloc[i]['pct']:.1f}%")
Commuter:
  1. Schenley Dr & Schenley Dr Ext: 64.8%
  2. Forbes Ave & Schenley Dr: 63.3%
  3. Forbes Ave at TCS Hall (CMU Campus): 61.2%

Last-Mile:
  1. Boulevard of the Allies & Parkview Ave: 46.9%
  2. S Millvale Ave & Centre Ave: 43.4%
  3. Allequippa St & Darragh St: 42.0%

Errand:
  1. Wilkinsburg Park & Ride: 68.4%
  2. Second Ave & Tecumseh St: 61.4%
  3. Shady Ave & Ellsworth Ave: 56.5%

Leisure:
  1. South Side Trail & S 4th St: 19.0%
  2. Liberty Ave & Stanwix St: 18.9%
  3. W Ohio St & Brighton Rd: 18.3%

7. Geographic Visualization¶

Interactive map showing station locations and trip volumes.

In [22]:
# Load bus stop data
bus_stops = pd.read_csv('dataset/PRT_Bus_Stop_Usage_Unweighted_-5413281589865626035.csv', encoding='utf-8-sig')
# Clean duplicates as in ETL
bus_stops = bus_stops.drop_duplicates(subset='stop_id', keep='first')
print(f"Loaded {len(bus_stops)} unique bus stops")

# Calculate trip volumes per bike station
station_volumes = trips.groupby('Start Station Name').size().reset_index(name='total_trips')
station_data = stations.merge(station_volumes, left_on='Name', right_on='Start Station Name', how='left')
station_data['total_trips'] = station_data['total_trips'].fillna(0).astype(int)

print(f"Bike stations with trip data: {len(station_data[station_data['total_trips'] > 0])}")
station_data[station_data['total_trips'] > 0].head()
Loaded 7075 unique bus stops
Bike stations with trip data: 60
Out[22]:
Id Name Total Docks Latitude Longitude Start Station Name total_trips
0 1 Pierce St & Summerlea St 19 40.456507 -79.932331 Pierce St & Summerlea St 4766
1 2 Eliza Furnace Trail & Swineburne St 15 40.425800 -79.953400 Eliza Furnace Trail & Swineburne St 4516
2 3 Centre Ave & Addison St 15 40.444600 -79.978300 Centre Ave & Addison St 1238
3 4 Burns White Center at 3 Crossings 15 40.456400 -79.980000 Burns White Center at 3 Crossings 5077
4 5 Allegheny Station 19 40.448301 -80.018007 Allegheny Station 2849
In [23]:
# Create interactive Folium map
output_dir = Path('./static_viz')
output_dir.mkdir(exist_ok=True)

pgh_center = [40.4406, -79.9959]
m = folium.Map(location=pgh_center, zoom_start=12, tiles='CartoDB positron') # Zoom out slightly for system view

# Helper function to calculate distance (Haversine)
def calculate_distance(lat1, lon1, lat2, lon2):
    R = 6371000  # Earth radius in meters
    lat1, lon1, lat2, lon2 = map(np.radians, [lat1, lon1, lat2, lon2])
    dlat = lat2 - lat1
    dlon = lon2 - lon1
    a = np.sin(dlat/2)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2)**2
    c = 2 * np.arcsin(np.sqrt(a))
    return R * c

# Get active bike stations with coordinates
active_bike_stations = station_data[station_data['total_trips'] > 0][['Latitude', 'Longitude']].values

# Check which bus stops are within 400m of any bike station
def is_near_bike_station(bus_lat, bus_lon, threshold=400):
    for bike_lat, bike_lon in active_bike_stations:
        distance = calculate_distance(bus_lat, bus_lon, bike_lat, bike_lon)
        if distance <= threshold:
            return True
    return False

# Count integrated bus stops
integrated_count = 0
total_bus_stops = 0

# Add bike stations (Green Squares - matched to interactive dash)
# We use green CircleMarkers to approximate the "Green" style
for _, row in station_data.iterrows():
    if row['total_trips'] > 0:
        # Let's use Green (#00B884) for all POGOH stations to match dashboard
        folium.CircleMarker(
            location=[row['Latitude'], row['Longitude']],
            radius=6, # Standard size
            popup=f"<b>🚴 {row['Name']}</b><br>{row['total_trips']:,} trips",
            tooltip=row['Name'],
            color='#000000', # Black border
            fill=True,
            fillColor='#00B884', # Green fill
            fillOpacity=1.0,
            weight=2
        ).add_to(m)

# Add bus stops (Red if within 400m, Gray otherwise)
for _, row in bus_stops.iterrows():
    if pd.notna(row['latitude']) and pd.notna(row['longitude']):
        total_bus_stops += 1
        near_bike = is_near_bike_station(row['latitude'], row['longitude'])
        
        if near_bike:
            integrated_count += 1
            color = '#FF2B4C'  # Red - Integrated
            fill_color = '#FF2B4C'
            opacity = 0.8
            fill_opacity = 0.8
            label = 'Integrated (within 400m)'
            radius = 2 # Small dot
        else:
            color = '#999999'  # Gray - Not integrated
            fill_color = '#999999'
            opacity = 0.5
            fill_opacity = 0.5
            label = 'Not integrated'
            radius = 2 # Small dot
        
        folium.CircleMarker(
            location=[row['latitude'], row['longitude']],
            radius=radius,
            popup=f"<b>🚌 {row['stop_name']}</b><br>Bus Stop<br><i>{label}</i>",
            tooltip=row['stop_name'],
            color=color,
            fill=True,
            fillColor=fill_color,
            fillOpacity=fill_opacity,
            weight=0
        ).add_to(m)

# Add legend
legend_html = f'''
<div style="position: fixed; bottom: 50px; right: 50px; width: 250px; background-color: white; 
     border:2px solid black; z-index:9999; font-size:12px; padding: 15px; font-family: monospace;">
<p style="font-weight:bold; border-bottom:1px solid #ccc; margin-top:0;">MAP KEY</p>
<p><span style="color:#00B884; font-size:14px;">●</span> POGOH Station</p>
<p><span style="color:#FF2B4C; font-size:14px;">●</span> Bus Stop (Nearby POGOH)</p>
<p><span style="color:#999999; font-size:14px;">●</span> Bus Stop (Not Nearby)</p>
<hr>
<p style="font-size:11px; margin-bottom:0;">
<b>System Integration:</b><br>
{integrated_count:,} / {total_bus_stops:,} stops<br>
({integrated_count/total_bus_stops*100:.1f}% within 400m)
</p>
</div>
'''
m.get_root().html.add_child(folium.Element(legend_html))

# Save and display
m.save(output_dir / 'fig9_station_map.html')
print(f"✓ Saved fig9_station_map.html")
print(f"  • {len(station_data[station_data['total_trips'] > 0])} bike stations")
print(f"  • {total_bus_stops:,} bus stops")
print(f"  • {integrated_count:,} integrated bus stops (within 400m)")
print(f"  • Integration rate: {integrated_count/total_bus_stops*100:.1f}%")

# Display in notebook
m
✓ Saved fig9_station_map.html
  • 60 bike stations
  • 7,075 bus stops
  • 817 integrated bus stops (within 400m)
  • Integration rate: 11.5%
Out[23]:
Make this Notebook Trusted to load map: File -> Trust Notebook

System Insight: This map visualizes the full Pittsburgh transit network (7,000+ bus stops).

  • Red Dots: Bus stops within 400m (5-min walk) of a POGOH station.
  • Gray Dots: Bus stops outside the bikeshare catchment area.

Only 11.5% of bus stops are currently integrated with bikeshare, clustering heavily in Oakland and East Liberty. The vast gray sea represents the "Last Mile" opportunity gap.

8. Static Visualization Generation¶

Now create publication-quality charts using Seaborn and Bokeh.

output_dir = Path('./static_viz') output_dir.mkdir(exist_ok=True) print(f"✓ Output: {output_dir.absolute()}")

In [24]:
df_arch = pd.DataFrame(archetype_labels)
df_arch['pct'] = (df_arch['count'] / df_arch['count'].sum()) * 100
df_arch = df_arch.sort_values('count', ascending=True)

fig, ax = plt.subplots(figsize=(10, 5))
colors = ['#2B4CFF', '#34C759', '#FF9500', '#FF2B8C']
ax.barh(df_arch['label'], df_arch['count'], color=colors, edgecolor='black', linewidth=1.5)
ax.set_title('FIG 2: Trip Archetypes (K-Means)', fontsize=16, fontweight='bold', pad=20)
ax.set_xlabel('Trips', fontsize=12, fontweight='bold')
ax.grid(axis='x', alpha=0.3)

for i, (_, row) in enumerate(df_arch.iterrows()):
    ax.text(row['count'] + 5000, i, f"{row['pct']:.1f}%", va='center', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.savefig(output_dir / 'fig2_archetypes.png', dpi=150, bbox_inches='tight')
plt.show()
print("✓ Saved fig2_archetypes.png")
No description has been provided for this image
✓ Saved fig2_archetypes.png

Finding: Commuter (47.9%), Last-Mile (32.8%), Errand (15.7%), Leisure (3.6%)

FIG 3: Rider Type¶

In [25]:
df_demo = pd.DataFrame({'Type': rider_totals.index, 'Trips': rider_totals.values, 'Pct': rider_pct.values})

fig, ax = plt.subplots(figsize=(8, 5))
ax.bar(df_demo['Type'], df_demo['Trips'], color=['#FF9500', '#2B4CFF'], edgecolor='black', linewidth=2, width=0.6)
ax.set_title('FIG 3: Rider Type', fontsize=16, fontweight='bold', pad=20)
ax.set_ylabel('Trips', fontsize=12, fontweight='bold')
ax.grid(axis='y', alpha=0.3)

for i, (_, row) in enumerate(df_demo.iterrows()):
    ax.text(i, row['Trips'] + 5000, f"{int(row['Trips']):,}\n({row['Pct']:.1f}%)", ha='center', va='bottom', fontsize=11, fontweight='bold')

plt.tight_layout()
plt.savefig(output_dir / 'fig3_demographics.png', dpi=150, bbox_inches='tight')
plt.show()
print("✓ Saved fig3_demographics.png")
No description has been provided for this image
✓ Saved fig3_demographics.png

Finding: 98.7% MEMBER (utilitarian system)

FIG 4: Hourly Pattern¶

In [26]:
hourly = trips.groupby('hour').size().reset_index(name='trips')
hourly.head(10)
Out[26]:
hour trips
0 0 9638
1 1 6900
2 2 4355
3 3 2178
4 4 1319
5 5 2110
6 6 5617
7 7 14709
8 8 19721
9 9 23746
In [27]:
hourly = trips.groupby('hour').size().reset_index(name='trips')

fig, ax = plt.subplots(figsize=(14, 5))
ax.plot(hourly['hour'], hourly['trips'], linewidth=3, color='#2B4CFF', alpha=0.8, marker='o', markersize=6)
ax.fill_between(hourly['hour'], hourly['trips'], alpha=0.2, color='#2B4CFF')

ax.set_title('FIG 4: Hourly Trip Distribution (24-Hour Pattern)', fontsize=16, fontweight='bold', pad=20)
ax.set_xlabel('Hour of Day', fontsize=12, fontweight='bold')
ax.set_ylabel('Total Trips', fontsize=12, fontweight='bold')
ax.set_xticks(range(0, 24, 2))
ax.grid(alpha=0.3, linestyle='--')

# Annotate peaks
peak_hour = hourly.loc[hourly['trips'].idxmax()]
ax.annotate(f"Peak: {int(peak_hour['trips']):,} trips\nat {int(peak_hour['hour'])}:00",
            xy=(peak_hour['hour'], peak_hour['trips']), 
            xytext=(peak_hour['hour']+2, peak_hour['trips']+2000),
            fontsize=10, bbox=dict(boxstyle='round', facecolor='yellow', alpha=0.7),
            arrowprops=dict(arrowstyle='->', color='black'))

plt.tight_layout()
plt.savefig(output_dir / 'fig4_hourly.png', dpi=150, bbox_inches='tight')
plt.show()
print("✓ Saved fig4_hourly.png")
No description has been provided for this image
✓ Saved fig4_hourly.png

Finding: Bimodal (8-9 AM, 5 PM peak)

FIG 5: Heatmap¶

In [28]:
fig, ax = plt.subplots(figsize=(14, 6))
sns.heatmap(heatmap_pivot, cmap='YlOrRd', annot=False, cbar_kws={'label': 'Trips'}, linewidths=0.5, ax=ax)
ax.set_title('FIG 5: Day × Hour Heatmap', fontsize=16, fontweight='bold', pad=20)
ax.set_xlabel('Day', fontsize=12, fontweight='bold')
ax.set_ylabel('Hour', fontsize=12, fontweight='bold')

plt.tight_layout()
plt.savefig(output_dir / 'fig5_heatmap.png', dpi=150, bbox_inches='tight')
plt.show()
print("✓ Saved fig5_heatmap.png")
No description has been provided for this image
✓ Saved fig5_heatmap.png

Finding: Weekend cooling (30-40% lower Sat/Sun)

FIG 6: Station Archetypes (2×2)¶

In [29]:
fig, axes = plt.subplots(2, 2, figsize=(16, 10))
fig.suptitle('FIG 6: Behavioral Hotspots', fontsize=18, fontweight='bold', y=0.995)

configs = [
    ('Commuter', '🚴 COMMUTER', '#2B4CFF', axes[0,0]),
    ('Last-Mile', '🔗 LAST-MILE', '#FF9500', axes[0,1]),
    ('Errand', '🛒 ERRAND', '#34C759', axes[1,0]),
    ('Leisure', '🎨 LEISURE', '#FF2B8C', axes[1,1])
]

for arch, title, color, ax in configs:
    data = station_archetype_top[arch]
    labels = [s[:35]+'...' if len(s)>35 else s for s in data['stations']]
    ax.barh(labels, data['percentages'], color=color, edgecolor='black', linewidth=2)
    ax.set_title(title, fontsize=13, fontweight='bold', pad=10)
    ax.set_xlabel('% of Trips', fontsize=11, fontweight='bold')
    ax.set_xlim(0, 75)
    ax.grid(axis='x', alpha=0.3)
    ax.invert_yaxis()
    for i, (pct, trips, total) in enumerate(zip(data['percentages'], data['trip_counts'], data['total_trips'])):
        ax.text(pct+1, i, f"{pct:.1f}% ({trips:,}/{total:,})", va='center', fontsize=9, fontweight='bold')

plt.tight_layout()
plt.savefig(output_dir / 'fig6_station_archetypes.png', dpi=150, bbox_inches='tight')
plt.show()
print("✓ Saved fig6_station_archetypes.png")
No description has been provided for this image
✓ Saved fig6_station_archetypes.png

Finding: Schenley Dr (64.8% commuter), Wilkinsburg P&R (68.4% errand)

FIG 7: Archetype Comparison¶

In [30]:
fig, ax = plt.subplots(figsize=(10, 6))

# Use the same data as before
colors = ['#2B4CFF', '#34C759', '#FF9500', '#FF2B8C']
bars = ax.bar(df_arch['label'], df_arch['count'], color=colors, edgecolor='black', linewidth=2, width=0.6)

ax.set_title('FIG 7: Trip Archetype Distribution', fontsize=16, fontweight='bold', pad=20)
ax.set_xlabel('Trip Archetype', fontsize=12, fontweight='bold')
ax.set_ylabel('Number of Trips', fontsize=12, fontweight='bold')
ax.grid(axis='y', alpha=0.3, linestyle='--')

# Add value labels on top of bars
for i, (_, row) in enumerate(df_arch.iterrows()):
    ax.text(i, row['count'] + 5000, f"{row['count']:,}\n({row['pct']:.1f}%)", 
            ha='center', va='bottom', fontsize=10, fontweight='bold')

plt.tight_layout()
plt.savefig(output_dir / 'fig7_archetypes.png', dpi=150, bbox_inches='tight')
plt.show()
print("✓ Saved fig7_archetypes.png")
No description has been provided for this image
✓ Saved fig7_archetypes.png

FIG 8: Top Stations by Rider Type¶

In [31]:
df_stations = demo_data.reset_index()
df_stations['Total'] = df_stations['CASUAL'] + df_stations['MEMBER']
df_stations = df_stations.sort_values('Total', ascending=True)
df_stations['Short'] = df_stations['Start Station Name'].apply(lambda x: x[:30]+'...' if len(x)>30 else x)

fig, ax = plt.subplots(figsize=(12, 8))
x = np.arange(len(df_stations))
width = 0.35

ax.barh(x - width/2, df_stations['CASUAL'], width, label='Casual', color='#FF9500', edgecolor='black')
ax.barh(x + width/2, df_stations['MEMBER'], width, label='Member', color='#2B4CFF', edgecolor='black')

ax.set_title('FIG 8: Top 10 Stations by Rider Type', fontsize=16, fontweight='bold', pad=20)
ax.set_xlabel('Trips', fontsize=12, fontweight='bold')
ax.set_yticks(x)
ax.set_yticklabels(df_stations['Short'], fontsize=9)
ax.legend(loc='lower right', fontsize=11)
ax.grid(axis='x', alpha=0.3)

plt.tight_layout()
plt.savefig(output_dir / 'fig8_top_stations.png', dpi=150, bbox_inches='tight')
plt.show()
print("✓ Saved fig8_top_stations.png")
No description has been provided for this image
✓ Saved fig8_top_stations.png

Finding: All top stations are 98%+ MEMBER

Summary¶

Key Findings:

  1. 40× variance in daily ridership (98 → 3,892 trips) - academic calendar dependency
  2. 68.1% Campus Corridor trips (CMU/Pitt concentration)
  3. 92.9% MEMBER trips (utilitarian system, not tourist amenity)
  4. 4 behavioral archetypes: Commuter (47.9%), Last-Mile (32.8%), Errand (15.7%), Leisure (3.6%)
  5. Station specialization requires differentiated rebalancing strategies

Visualizations Generated:

  • ✓ FIG 1: Daily timeseries (Seaborn line chart)
  • ✓ FIG 2: Trip archetypes (Seaborn horizontal bars)
  • ✓ FIG 3: Rider demographics (Seaborn bar chart)
  • ✓ FIG 4: Hourly patterns (Seaborn line + area chart)
  • ✓ FIG 5: Day × Hour heatmap (Seaborn)
  • ✓ FIG 6: Station behavioral hotspots (Seaborn 2×2 grid)
  • ✓ FIG 7: Archetype comparison (Seaborn bar chart)
  • ✓ FIG 8: Top stations by rider type (Seaborn grouped bars)
  • ✓ FIG 9: Bike + bus station map (Folium interactive)

Total: 8 Seaborn PNG + 1 Folium HTML map

Interactive Dashboard: https://rzrizaldy.github.io/pgh-transit-atlas/


Footnote: All logic and exploratory finding done by Rizaldy, with help on GenAI (Gemini + Claude) to complete syntaxing.